Inferring Biologically Relevant Models: Nested Canalyzing 

Functions* 



Inferring dynamic biochemical networks is one of the main challenges in systems biology. Given 
experimental data, the objective is to identify the rules of interaction among the different entities of 
the network. However, the number of possible models fitting the available data is huge and identify- 
ing a biologically relevant model is of great interest. Nested canalyzing functions, where variables in 
a given order dominate the function, have recently been proposed as a framework for modeling gene 
regulatory networks. Previously we described this class of functions as an algebraic toric variety. In 
this paper, we present an algorithm that identifies all nested canalyzing models that fit the given data. 
We demonstrate our methods using a well-known Boolean model of the cell cycle in budding yeast. 

1 Introduction 

Inferring dynamic biochemical networks is one of the main challenges in systems biology. Many math- 
ematical and statistical methods, within different frameworks, have been developed to address this prob- 
lem, see lf2TTl for a review of some of these methods. Starting from experimental data and known bi- 
ological properties only, the idea is to infer a "most likely" model that could be used to generate the 
experimental data. Here the model could have two parts. The first one is the static network which is a 
directed graph showing the influence relationships among the components of the network, where an edge 
from node y to node x implies that changes in the concentration of y could change the concentration of 
x. The other part of the model is the dynamics of the network, which describes how exactly the con- 
centration of x is affected by that of y. Due to the fact that biological networks are not well-understood 
and the available data about the network is usually limited, many models end up fitting the available 
information and the criteria for choosing a particular model are usually not biologically motivated but 
rather a consequence of the modeling framework. 

A framework that has long been used for modeling gene regulatory networks is time-discrete, finite- 
space dynamical systems. This includes Boolean networks [13], Logical models [24], Petri nets E21 . and 
algebraic models lfl4l . The latter is a straightforward generalization of Boolean networks to multistate 
systems. Furthermore, in ||25l . it was shown that logical models as well as Petri nets could be viewed and 
analyzed as algebraic models. The inference methods we develop here are within the algebraic models 
framework. To be self-contained, we briefly describe this framework and state some of the known results 
that we need in this paper, see |[T4l [TOl l8l l25l for more details. Throughout this paper, we will be talking 
about gene regulatory networks, however, the methods apply for biochemical networks in general. 

Suppose that the gene regulatory network that we want to infer has n genes and that we have a set 
D of r state transition pairs (sy,ty), j = 1, ...,r. The input Sj and the output tj are ^-tuples of and 1 
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encoding the state of genes xi, . ■ . ,x n . Real time data points are not Boolean but could be discretized 
(and in particular, could be made Boolean) using different methods [4 ]. The goal is to find a model 

f={fufz,...,fn):n—>®2 

such that, for j = 1 , . . . , r, 

f( S j) = (f l ( S j),...,f n ( Sj ))=tj. 

Notice that, since any function over a finite field is a polynomial, each ft is a polynomial. An algorithm 
that finds all models / is presented in lfl4l . This is done by identifying, for each gene i, the set of all 
possible functions for f\. This set can be represented as the coset f + I, where / is a particular such 
function and / C F2[jci, . . . ,x n ] is the ideal of all Boolean polynomials that vanish on the input data set, 
that is, / = I({si , . . . , s r }). The algorithm in lfl4l then proceeds to find a particular model from the model 
space / + /. The chosen model, which is the normal form of / in the ideal /, depends on the term 
ordering used in the Grobner bases computation. So different ordering of the variables (genes) might 
lead to the selection of different models. This presents a problem as the term ordering, which is a needed 
for computational reasons, clearly influence the model selection process. 

Several modifications have since been presented to address this problem. For example, in 0, using 
the Grobner fan of the ideal /, the authors developed a method that produces a probabilistic model using 
all possible normal forms. Other improvements on this algorithm can be found in I9ll23l. 

Another approach toward improving the model selection process is by restricting the model space 
/ + / by requiring not only that the chosen model fits the data but also satisfies some other conditions, 
such as its network being sparse or scale-free, the polynomials /; being monomials, the dynamics of the 
model having some desirable properties such as fixed points are the only limit cycles (that is, starting 
from any initialization, the model always reaches a steady state), or that the model is robust and stable 
which could roughly mean that the number of attractors in the phase space is small. In a nutshell, some 
but not all functions in the model space / + / are biologically relevant and hence restricting the space to 
only relevant models will improve the model selection process. 

By desiring a particular property, several classes of functions have been proposed as biologically 
relevant functions such as biologically meaningful rules fT9l , certain post classes of Boolean functions 
have been studied in |[20l . and chain functions in Q, to name few. Another class of Boolean functions, 
which was introduced by S. Kauffman et al. Ill 21 . is called (nested) canalyzing functions (NCF), where 
an input to a single variable exclusively determines the value of the function regardless of the values of 
all other variables. This is a natural characterization of "canalisation" which was introduced by geneticist 
C. H. Waddington [26 ] to represent the ability of a genotype to produce the same phenotype regardless of 
environmental variability. Indeed, known biological functions have been shown to be canalyzing ll6llT7l. 
and Boolean nested canalyzing networks to be robust and stable iTTTl fT2l ITTt . 

For the purpose of restricting the model space / + / of all Boolean polynomial models to NCFs 
only, we previously studied nested canalyzing functions, gave necessary and sufficient conditions on the 
coefficients of a boolean polynomial function to be nested canalyzing, and showed that NCFs are nothing 
but unate cascade functions iflOl . Furthermore, in QQ, the class of all nested canalyzing functions is 
parameterized as the rational points of an affine algebraic variety over the algebraic closure of F2. This 
variety was shown to be toric, that is, defined by a collection of binomial polynomial equations. In 
this paper we present an algorithm that restricts the model space to only nested canalyzing functions by 
identifying all NCFs from the model space / + / that fit the given data set. 

In the next section we briefly recall some definitions and results from IflOl [H. Our algorithm is 
presented in Section|3l and its implementation in Singular is discussed in SectionlU Before we conclude 
this paper, we demonstrate the algorithm in Section [5l where we identify all nested canalyzing models 
for the cell cycle in budding yeast using time course data from the Boolean model in lfI51 . 
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2 Nested Canalyzing Functions: Background 

We recall some of the definitions and the results from lfT0l l8l that we need to make this presentation self- 
contained. Throughout this paper, when we refer to a function on n variables, we mean that h depends on 
all n variables, that is, for i = 1, . .. ,n, there exists (a\, . . . ,a n ) E F'j such that h(a\,. . . ,a,-_i, a { , a i+ i, . . . ,a n ) ^ 
h(a\ 1 +ai,ai + \ , . . . ,a n ). 

Definition 2.1. Let h be a Boolean function on n variables, i.e., h : FJJ — > F 2 . 

• The function h is a nested canalyzing function (NCF) with respect to a permutation a on the n 
variables, canalyzing input value a,- and canalyzed output value bj, for i = 1,. . . ,n, if it can be 
represented in the form 



/i(xi,x 2 , ••-,*«) 



61 if X fT ( 1 ) = £?i, 

6 2 if ^ a\ and x ct(2 ) = a 2 , 

b 3 ifJC CT (i) 7^1 andx CT ( 2 ) /a 2 andx f7 ( 3 ) =a 3 , 

fc„ if 7^ «i and • • • and x a (n-\) 7^ and ^a(n) = ««> 
if ■%(!) 7^ «i and ■ • • and x a ^ ^ a n . 



(1) 



• The function h is nested canalyzing if h is nested canalyzing with respect to some permutation a, 
canalyzing input values a\,...,a n and canalyzed output values bi,...,b n , respectively. 

Remark 2.2. The definition above has been generalized to multistate functions in |fT6ll . where it is also 
shown that the dynamics of these functions are similar to their Boolean counterparts. In |[T8l . the authors 
introduce what they called kinetic models with unate structure, which are continuous models having the 
canalization property, and they presented an algorithm for identifying such models. 

Using the polynomial form of any Boolean function, the ring of Boolean functions is isomorphic 
to the quotient ring R = W%[xi,. . . ,x n ]/J, where / = {xf — x ; - : 1 < i < n). Indexing monomials by the 
subsets of [n] := {1, . . . ,n} corresponding to the variables appearing in the monomial, the elements of R 
can be written as 

SC[n] ieS 

As a vector space over F 2 , R is isomoiphic to W\ via the correspondence 

RB E c s f[x f <— ►(«*,. ...c^EFf. (2) 

SQn] i€S 

The main result in ifTUll is the identification of the set of nested canalyzing functions in R with a 
subset V" c f of F2" by imposing relations on the coordinates of its elements. 

Definition 2.3. Let a be a permutation of the elements of the set [n]. We define a new order relation < a 
on the elements of [n] as follows: a(i) < a a(j) if and only if i < j. Let r° be the maximum element of a 
nonempty subset S of [n] with respect to the order relation < a . For any nonempty subset 5 of [n], the com- 
pletion ofS with respect to the permutation a, denoted by [rj ], is the set [r£ ] = |cr(l), cj(2), . . . , o(r°)}. 

Note that, if a is the identity permutation, then the completion is [r$] := {1,2, . . . , rs}, where is the 
largest element of S. 
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Theorem 2.4. Let h £R and let o be a permutation of the set [n]. The polynomial h is nested canalyzing 
with respect to O, input value a± and corresponding output value b[,for i= 1, . . . ,n, if and only ifcuj = 1 
and, for any proper subset S C [n], 

c s = c V° s \ II c M\M0}- ( 3 ) 
«r(0e[if]\s 

Corollary 2.5. The set of points in Fj" corresponding to the set of all nested canalyzing functions with 
respect to a permutation a on [n], denoted by , is defined by 

Vff C/ = {(c0,...,C[„]) eFf :c[„j = l,c s = c [r j] Y[ c [n}\{a(i)}Jor S C [«]}. (4) 

<r(i)e[^]\S 

It was shown in (8] that is an algebraic variety, and its ideal I(V% C ^ ) is a binomial prime ideal 
in the polynomial ring F2[{cs : S Q [«]}], where F2 is the algebraic closure of F2. Namely, 

la = I(Vff C/ ) = (c[ n ] ~ 1 5 c s - c [r aj c W \ {fT(; - )} : 5 C [n]). 

<7(i)e[rJ]\S 

Furthermore, the variety of all nested canalyzing functions is 

v nc f = \Jv™ f 

a 

and its ideal is 

I(y-/) = P|/ ff . 

In the next section, we identify the set / + / with the rational points in an algebraic affine variety. 
This will allow us to identify all nested canalyzing functions in the model space / + /. 



3 Nested Canalyzing Models 

Recall that we are given the data set D = {(si,ti), . . . , (s n t r )} cFjX FJ[. The model space could be 
presented by the set / + / where, / = {f\ , . . . , f n ) and, for i = 1 , . . . , n, 

r n 

f(x u ...,X n ) = J^i-JIv 1 - (*e-Sj,e)). (5) 
j=l e=l 

In particular, f is a polynomial that interpolates the data for gene i and / is the ideal of points of 
{si, . . . ,s r }. Furthermore, the ideal / is a principal ideal in the ring R/J: 

I = I({si,...,s r }) (6) 

= hi(fa}) < 7 > 

r 

= f){xi-SjA,...,X n -Sj;,) (8) 

M 

= n<i-fi(i-(^-^))> (9) 
= (ri(i-n(i-(**-^)))>- do) 

j=\ e=l 
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Now a polynomial hEfi +1 if and only if h = f +g(x\,. . . ,JCn)Tlj=i (1 ~~ 11"= 1 (1 ~~ { x e~ Sj.e))), for some 
polynomial g, say g = Y,HC[n] ^jylLetf*/* By expanding the right-hand side and collecting terms, we get 
that h = YiSC[n]^s(bii,Sj,tj)Yli eS xi, where, for S C [n], the coefficient Ws(bji,Sj,tj) is determined by 
bn,Sj,tj for all // C [n] and 7 = 1,. . . ,r. 

The proof of the following theorem follows directly from Theorem 2.4.2 in JT]. 

Theorem 3.1. Consider the ring homomorphism 

<J> : F 2 [{C5 : S C [«]}] — > F 2 [{o ff : // C [n]}] 

g/ve« oy, /or 5 C [n], 

c s i-> W s (b H ,Sj,tj). 

Then ker(<J>) is the ideal of all polynomials that fit the data set D. In particular, the rational points in the 
variety V(ker(<J>)) is the set of all models that fit the data set D, namely f + I. 

Since the ideal of all NCFs is \(y nc f ), the following corollary is straightforward. 

Corollary 3.2. The ideal of all nested canalyzing functions that fit the data set D is I(V nc f) +ker(<I>). 

Remark 3.3. It is clear that the model space of Boolean functions is huge, since the number of monomials 
grows exponentially in the number of variables. For example, if a function has 5 inputs, there are 2 5 = 32 
different monomials in 5 variables, and hence 2 32 = 4,294,967,296 different Boolean functions. This 
clearly shows that a search for NCFs inside the model space is computationally not feasible, which 
justifies the need for algorithms like the one above. 

4 Algorithm 

In this section we present an algorithm for identifying all nested canalyzing models from the model space 
of a given data set. 

Input 

A wiring diagram, i.e., a square matrix of dimension n, describing the influence relationships among the n 
genes in the network. For each variable*,, a table consisting of the rows (sjj [ , . . . ,Sjj s ,tjj), j = {1, . . .,r}, 
where h,..., i s are the indices of the genes that affect jc,-, as specified in the wiring diagram. 

Output 

For each variable, the complete list of all nested canalyzing functions interpolating the given data set on 
the given wiring diagram. A function is in the output if it is nested canalyzing in at least one variable 
order. If needed, the code can easily be modified to find only nested canalyzing functions of a particular 
variable order. 

Algorithm 

It is a well known fact, that a Grobner basis for the kernel of <I> is a basis for (cs — Ws : S C [n]}) 
intersected with the ring F2 [{cs : 5 C [n]}] [I] Theorem 2.4.2] Using a similar notation as above, the 
algorithm is outlined as follows: 
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Use ring ¥ 2 [xi,...,x„,b s ,c s : S C [«]}] 
Define I(V" c f) as ideal in F 2 [c 5 : 5 C [«]}] 
Define h = Z H c[n] t>HUieHXi 
Define q = Zhc[h] CH~[lieH x i 

Compute the polynomial p that generates / as in ([TOl 
For each variable x { do 

1 . Compute as in © 

2. Let g = + /j * its coefficients are the same as W$ above 

3. Compute a Grobner basis G for the ideal generated by the coefficients of g — q using any 
elimination order to eliminate all b$ from G 

4. Concatenate generators of G and I(y nc ^ ) 

5. Compute the primary decomposition of G + I(V nc ^ ) to obtain necessary and sufficient conditions 
on the coefficients of all NCFs fitting the data set D 

End 

An implementation of this algorithm is available as a Singular library QUI. 

5 Application: Inferring the Cell Cycle Network in Budding Yeast 

Li et al. lfl5l constructed a Boolean threshold model of the cell cycle in budding yeast. The network of 
the model has the key known regulators of the cell cycle process and the known interactions among these 
regulators in the literature. The Boolean function at each node, however, is a threshold function, which 
is completely determined by the numbers of active activators and active inhibitors, and is not necessarily 
biologically motivated. However, this model captures the known features of the global dynamics of the 
cell cycle, it is robust and stable, and the trajectory of the known cell-cycle sequence is a stable and 
attracting trajectory as it has 1764 states out of the total number of 2048 states. The remaining states are 
distributed into 6 small trajectories. 

In this Section, we use the time course corresponding to the biological cell-cycle sequence, see Table 
[U to infer nested canalyzing models of the cell cycle. That is, assuming the same wiring diagram as 
the threshold model above, we use our algorithm to identify, for each gene in the network, all nested 
canalyzing functions that fit the cell cycle sequence. 

We start by describing the network. It consists of 1 1 proteins and a start signal. The proteins are 
members of the following three classes: cyclins (Clnl,2, Cln3, Clbl,2, and Clb5,6), inhibitors, degraders, 
and competitors of cyclin complexes (Sicl, Cdhl, Cdc20 and Cdcl4), and transcription factors (SBF, 
MBF, Swi5, and Mcml/SFF). This simplified network (Figured]) is almost identical to the network in 
|[T5l where the only difference is that we do not force self-degradation, as it was added to some nodes in 
the network because they did not have inhibitors, but without biological justification fT3| . Furthermore, 
we do not impose activation or inhibition in the network. As we do not use threshold functions but 
more general boolean functions, a variable can both increase and decrease the concentration of another 
substrate, depending on the concentrations of other proteins. 

Li et al. jl5l use their model to generate a time course of temporal evolution of the cell-cycle 
network, shown in Table Q] This time course is in agreement with the behavior of the cell-division 
process cycling through the four distinct phases G\, S (Synthesis), G2, and M (Mitosis). 

We used this time course along with the network as the only input to the algorithm to obtain all nested 
canalyzing functions that interpolate the time course. In the fifth column of Table|2]we list the number of 
NCFs for each protein. By requiring the Boolean function to be nested canalyzing, we have significantly 
reduced the number of possible functions for each protein as it is evident when comparing the numbers in 
the third and fifth columns of Table|2] However, even after this reduction, there are 330, 559, 488 possible 
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Figure 1: The simplified cell cycle network in budding yeast, which is based on the model in fl31 . 
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Table 1: The temporal evolution of the Boolean cell-cycle model in [151; corresponding to the biological 
cell-cycle sequence. 



nested canalyzing models that fit the time course in Table [TJ To reduce the this number more, one needs 
to use additional time courses or request that the models incorporate additional biological information 
about yeast cell cycle. 
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Table 2: For each protein i, we list the number of inputs, the number of possible Boolean functions (the 
cardinality of f + 1), the number of nested canalyzing functions with the given number of inputs, and 
finally the number of nested canalyzing functions in the model space fi + 1. 

5.1 Dynamics 

To analyze the dynamics of the resulting nested canalyzing models, we randomly sampled 2000 models 
and analyzed them. The average number of basins of attraction (components) per network is 3.09, and 
the average size of the component containing the given trajectory is 1889. In only 6 models, the trajectory 
in Table Q] is not in the largest component, however, the average size of the component containing the 
trajectory is 833.5. 

These results clearly show that nested canalyzing models for the cell cycle network are in agreement 
with the original threshold model of Li et al. , and since such models are known to be robust and stable, 
any of these models could be used as a model for the cell cycle in budding yeast. Furthermore, especially 
when there is no evidence for choosing a particular type of functions, a nested canalyzing function has 
an advantage over other possible choices. 

5.2 Comparison with Random Networks 

To understand the effect of the network itself on its dynamics, we sampled 2000 models on the same 
network where the local function of each gene in each one of these models is chosen randomly from all 
possible functions in the model space. We found that the given cell cycle trajectory has oftentimes much 
smaller basin of attraction, and hence random functions on the cell cycle network could not in general 
produce the desired dynamics. A comparison of the statistics from the sampled networks is shown in 
Figures |2] and [3] 

6 Conclusion 

In this paper we have presented an algorithm for identifying all Boolean nested canalyzing models that 
fit a given time course or other input-output data sets. Our algorithm uses methods from computational 
algebra to present the model space as an algebraic variety. The intersection of this variety with the 
variety of all NCFs, which was parameterized in [8], gives us the set of all NCFs that fit the data. We 
demonstrate our algorithm by finding all nested canalyzing models of the cell cycle network form Li et 
al. lfl5l . We then showed that the dynamics of almost any of these models is strikingly similar to that 
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Figure 2: Nested canalyzing functions with the wiring diagram in Figure Q] interpolating the time course 
in Tabled] x-axis: size of basin of attraction for given trajectory, y-axis: number of networks observed, 
out of 2000. 
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Figure 3: Not nested canalyzing functions with the wiring diagram in Figure Q] interpolating the time 
course in Table [TJ ;t-axis: size of basin of attraction for given trajectory, y-axis: number of networks 
observed, out of 2000. 



of the original threshold model. Unless the chosen model is required to meet other conditions, and in 
that case the model space will be reduced further, any one of the models that our algorithm found is an 
acceptable model of the cell cycle process in the Budding yeast. 

One limitation of the current algorithm, which we left for future work, is that it does not distinguish 
between activation and inhibition in the network as we do not have a systematic method of knowing 
when a given variable in a (nested canalyzing) polynomial is an activator or inhibitor. 

As our algorithm relies heavily on different Grobner based computations, the current implementation 
in Singular allows a given gene to have at most 5 regulators. This is due to the fact that the number of 
monomials then is 32 which is already a burden especially when the primary decomposition of an ideal 
is what we are after. We are working on a better implementation so that we can infer larger and denser 
networks. 
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